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Abstract 

We  consider  regularized  least-squares  (RLS)  with  a  Gaussian  kernel.  We 
prove  that  if  we  let  the  Gaussian  bandwidth  a  — >  oo  while  letting  the 
regularization  parameter  A  — >  0,  the  RLS  solution  tends  to  a  polynomial 
whose  order  is  controlled  by  the  relative  rates  of  decay  of  \  and  A:  if 
A  =  <7~(2k+1\  then,  as  cr  — *  oo,  the  RLS  solution  tends  to  the  fcth  order 
polynomial  with  minimal  empirical  error.  We  illustrate  the  result  with  an 
example. 


1  Introduction 

Given  a  data  set  ( X2 ,  yi),  •  •  • ,  ( xn ,  yn),  the  inductive  learning  task  is  to  build  a 

function  f(x)  that,  given  a  new  x  point,  can  predict  the  associated  y  value.  We  study  the 
Regularized  Least-Squares  (RLS)  algorithm  for  finding  /,  a  common  and  popular  algo¬ 
rithm  [2, 4]  that  can  be  used  for  either  regression  or  classification: 

1  " 

min  -  Ou)  “  Vi)2  +  AII/IIa- 

fen  n  ' 

Here,  H  is  a  Reproducing  Kernel  Hilbert  Space  (RKHS)  [1]  with  associated  kernel  function 
K ,  | / 1 1 is  the  squared  norm  in  the  RKHS,  and  A  is  a  regularization  constant  controlling 
the  tradeoff  between  fitting  the  training  set  accurately  and  forcing  smoothness  of  /. 
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RLSC  Results  for  GALAXY  Dataset 


Fig.  1.  RLS  classification  accuracy  results  for  the  UCI  Galaxy  dataset  over  a  range  of  a  (along  the 
®-axis)  and  A  (different  lines)  values.  The  vertical  labelled  lines  show  m,  the  smallest  entry  in  the 
kernel  matrix  for  a  given  a.  We  see  that  when  A  =  le  —  11.  we  can  classify  quite  accurately  when 
the  smallest  entry  of  the  kernel  matrix  is  .99999. 


The  Representer  Theorem  [6]  proves  that  the  RLS  solution  will  have  the  form 

n 

/0)  =  y ^CjK(xi,x), 

i=  1 

and  it  is  easy  to  show  [4]  that  we  can  find  the  coefficients  c  by  solving  the  linear  system 

(K  +  \nI)c=y,  (1) 

where  K  is  the  n  by  n  matrix  satisfying  Kl:j  =  K(xi ,  Xj). 

We  focus  on  the  Gaussian  kernel  K[xi ,  Xj )  =  exp(—  ||xj  —  Xj\ \2 /2a2). 

Our  work  was  originally  motivated  by  the  empirical  observation  that  on  a  range  of  bench¬ 
mark  classification  tasks,  we  achieved  surprisingly  accurate  classification  using  a  Gaussian 
kernel  with  a  very  large  a  and  a  very  small  A  (Figure  1;  additional  examples  in  [5]).  This 
prompted  us  to  study  the  large-er  asymptotics  of  RLS.  As  cr  — >  oo,  K(xt.  x:l)  — >  1  for 
arbitrary  Xi  and  Xj.  Consider  a  single  test  point  Xq.  RLS  will  first  find  c  using  Equation  1, 
then  compute 

/  Oo)  =  clk 

where  k  is  the  kernel  vector,  kj  =  K(xi,  Xq).  Combining  the  training  and  testing  steps,  we 
see  that 

/Oo)  =  y\K  +  A  nl)~1k 

Both  I\  and  k  are  close  to  1  for  large  cr,  i.e.  /\  ()  =  1  +  e,;;j  and  A:,  =  1  +  e,  .  If  we  directly 
compute  c  =  ( K  +  Xnl)^1y,  we  will  tend  to  wash  out  the  effects  of  the  e,j  term  as  a 


becomes  large.  If,  instead,  we  compute  f(x o)  by  associating  to  the  right,  first  computing 
point  affinities  ( K  +  Anl)~1k ,  then  the  el3  and  interact  meaningfully;  this  interaction  is 
crucial  to  our  analysis. 

Our  approach  is  to  Taylor  expand  the  kernel  elements  (and  thus  K  and  k)  in  1/a,  noting 
that  as  a  — >  oo,  consecutive  terms  in  the  expansion  differ  enormously.  In  computing  ( K  + 
A nl)~1k,  these  scalings  cancel  each  other  out,  and  result  in  finite  point  affinities  even  as 
er  — >  oo.  The  asymptotic  affinity  formula  can  then  be  “transposed”  to  create  an  alternate 
expression  for  f(x o).  Our  main  result  is  that  if  we  set  a2  =  s2  and  A  =  s“(2fe+1),  then,  as 
s  — »  oo,  the  RLS  solution  tends  to  the  fcth  order  polynomial  with  minimal  empirical  error. 

We  note  in  passing  that  our  work  is  somewhat  in  the  same  vein  as  the  elegant  recent  work 
of  Keerthi  and  Lin  [3];  they  consider  Support  Vector  Machines  rather  than  RLS,  and  derive 
only  the  linear  (first  order)  result. 

2  Notation  and  definitions 

Definition  1.  Let  xr  be  a  set  of  n  +  1  points  (0  <  i  <  n)  in  a  d  dimensional  space.  The 
scalar  Xia  denotes  the  value  of  the  ath  vector  component  of  the  ith  point. 

The  n  x  d  matrix,  X  is  given  by  Xla  =  Xia. 

We  think  of  X  as  the  matrix  of  training  data  Xi, ...  ,x„  and  Xo  as  an  1  x  d  matrix  consisting 
of  the  test  point. 

Let  lm,  1  im  denote  the  to  dimensional  vector  and  l  x  m  matrix  with  components  all  1, 
similarly  for  Om,  0/m  .  We  will  dispense  with  such  subscripts  when  the  dimensions  are  clear 
from  context. 

Definition  2  (Hadamard  products  and  powers).  For  two  l  x  to  matrices,  N,  M,  N  ©  M 

denotes  the  l  x  to  matrix  given  by  (TV  ©  M)ij  =  A©  ALir  Analogously,  we  set  ( N ®c)i  -  = 


Definition  3  (polynomials  in  the  data).  Let  I  £  Z>0  (non-negative  multi-indices)  and 
Y  be  a  k  X  d  matrix.  Y1  is  the  k  dimensional  vector  given  by  (Y1/.  =  Jla-i  Y-f.  If 
h  :  — »  R  then  h(Y)  is  the  k  dimensional  vector  given  by  ( h(Y))i  =  h(Yn, . . . ,  Yif). 

The  d  canonical  vectors,  ea  £  are  given  by  (ea)f,  =  Sab. 


For  example,  Xkea  is  the  ath  column  of  X  raised,  elementwise,  to  the  kth  power  and, 
similarly,  XQBa  =  x^a.  The  degree  of  the  multi-index  /  is  |/|  =  5Zf!=i  -  The  vector  h(Y) 
where  h(y)  =  Va 's  referred  to  as  ||V||2. 

In  constrast,  any  scalar  function,  /  :  R  — >  R,  applied  to  any  matrix  or  vector.  A,  will  be  as¬ 
sumed  to  denote  the  elementwise  application  of  /.  We  will  treat  y  ev  as  a  scalar  function 
(we  have  no  need  of  matrix  exponentials  in  this  work,  so  the  notation  is  unambiguous). 


We  can  re-express  the  kernel  matrix  and  kernel  vector  in  this  notation: 

K  =  e^T:Uix°«(x^y-x^ii-in(x^y 


diag  ^ 


lA'I 


T  XX 


diag  [ 


k  =  2X»»x5“-Xa'«l ,-Ux20‘ 


diag  ^ 2 


I  A'll 


5  _l2  Axt  L  1 1*0  I 


(2) 

(3) 

(4) 

(5) 


3  Orthogonal  polynomial  bases 


Let  Vc  =  span{AT7  :  |  J|  =  c}  and  V<c  =  Ua=o  ^ c  which  can  be  thought  of  as  the  set  of  all 
d  variable  polynomials  of  degree  c,  evaluated  on  the  training  data.  Since  the  data  are  finite, 
there  exists  b  such  that  V<c  =  V<  /,  for  all  c  >  b.  Generically,  b  is  the  smallest  c  such  that 


c  +  d 
d 


>  n. 


Let  Q  be  an  orthonormal  matrix  in  R”  x "  whose  columns  progressively  span  the  V<c 
spaces,  i.e.  Q  =  (Bo  B\  •  •  •  B^)  where  QfQ  =  I  and  colspan{(  Bo  ■  ■  ■  Bc )}  = 

V<c.  We  might  imagine  building  such  a  Q  via  the  Gramm-Schmidt  process  on  the  vectors 
X° ,  Xei , ,  Xed, . . .  X1 , . . .  taken  in  order  of  non-decreasing  |/|. 


Letting  C t  = 


\I\ 

h  ■  ■  -Id 


be  multinomial  coefficients,  the  following  relations  between 


Q,  X,  and  Xq  are  easily  proved. 


(X;4)0C  =  Y  CiX^xjyf  hence  ( Xx f>)0c  £  Vc 

|/|=c 

(XX*)00  =  Y  CiX^X1)1  hence  colspan{(JAX*)0c}  =  Vc 

|/|=c 


and  thus,  B\(X Xq)@c  =  0  if  i  >  c,  B\(X  X*)®0  B j  =  0  if  i  >  c  or  j  >  c,  and 
Btc(XXt)&cBc  is  non-singular. 

Finally,  we  note  that  argmin„eV^{||i/ —  u||}  =  Ea<c  BaiKv)- 


4  Taking  the  a  —>  oo  limit 

We  will  begin  with  a  few  simple  lemmas  about  the  limiting  solutions  of  linear  systems. 
At  the  end  of  this  section  we  will  arrive  at  the  limiting  form  of  suitably  modified  RLSC 
equations. 


Lemma  1.  Let  A(s)  be  a  continuous  matrix-valued  function  defined  for  0  <  s  <  So  for 
some  sq  £  R.  //lims^o  A(s)  =  Ao  and  A$  is  non-singular,  then  lim^o  A(s)^1  =  Af  • 


Proof  Given  e,  select  S  <  s0  such  that  || I  -  A(s)A0  <  min  ||,  2||  j  for  s  <  6 

(such  a  S  exists  since  lims^o  A(s)  =  Ao).  Note  that  ||I  —  A(s)A^  ||2  <  implies  A(s) 
is  non-singular.  Then 


A(s)"1  =  Af\l  -  ( I  -  A(s)A^))-'  =  A, E  J  +  YV  ~  A(*)^  1)i 


i>  1 


||A0-1^A(s)-1||2<||A0-1||2  117  A(s)A°”112 


1-||/-A(s)A0-1||2 


<  e. 


Corollary  1.  Let  A(s),  y(s)  be  continuous  matrix-valued  and  vector-valued  functions,  de¬ 
fined  for  0  <  s  <  Sq  for  some  So  £  R  with  lims^0  A(s)  =  A0  non-singular. 
lims^0  y{s)  =  y0  iff  lims^0  ^(s)_12/(s)  =  A~1y0. 


Proof.  By  lemma  1,  lims^o  A(s)  1  =  A0  . 

By  the  continuity  of  matrix  multiplication 

lim  B(s)x(s )  =  ( lim  B(s) )  ( lim  x(s) ) 
s—>0  Vs— ^0  /  Vs— >0  / 

(the  existence  of  the  right  hand  limits  implying  the  existence  of  the  left  hand  limit). 

If  lims^0  y(s)  =  Uo  then  let  B(s)  =  A_1(s)  and  x(s)  =  y(x). 

If  lims^o  A(s)_1y(s)  =  Xq  then  let  x(s)  =  A(s)~1y(s )  and  B(s )  =  A(s),  and  thus 
y0  =  lims^0  A(s)(A(s)^1y(s))  =  A0x0.  □ 

Lemma  2.  Let  A(s) ,  y (s)  be  matrix-valued  and  vector-valued  polynomials  of  degree  p  and 
B(s),z(s)  be  matrix-valued  and  vector-valued  functions  that  are  bounded  in  the  region 
0  <  s  <  Sq,  for  some  s0  G  R.  If  A(s)  is  non-singular  for  0  <  s  <  s0,  then 

lim(A(a)  +  sp+1B(s))-\y(s)  +  sp+1z{s))  =  lim  Afa^yfs). 

s—>0  s—>0 


Proof.  We  first  note  that  for  s  >  0, 

04(a)  +  sp+1B(s))-1  =  (/  +  sp+1  A(s)-1B(s))-1A(s)-1 

Since  A(a)  is  a  polynomial,  the  entries  of  A(s)_1  are  rational  functions  with  denominators 
of  degree  p.  Thus,  lims^o  sp+1A_1(s)  =  0,  and  thus,  by  the  boundedness  of  B(s)  and 

z(s)> 

sp+1A-1(s)0(s)  -►  0 
sp+1A-1(s)B(s)  ->  0. 

By  Lemma  1,  lims^o(^  +  sp+1A_1(s).B(,s))  =  I.  Thus,  by  Corollary  1, 
lim  (A(s)  +  sp+1  B(s))~1  (y(s)  +  sp+1z(s)) 

s—tO 

=  lim  (7  +  sp+1A(s)_1.B(s))_1A(s)_1(y(s)  +  sp+12(s)) 

s— ►  0 

=  lim  A(s)_1(y(s)  +  sp+1z(s)) 

s—tO 

=  lim  A(s)_1y(s). 

s— ►  0 

□ 


Lemma  3.  Let  i !<•••<  iq  be  positive  integers.  Let  A(s),  y(s)  be  a  block  matrix  and 
block  vector  given  by 


A(s)  = 

/  A0o(s) 
s*1  Aio(s) 

s*1  A0i(s)  •• 
s*1  An(s) 

■■  siqA0q(s)  \ 
■■  siqAlq(s)  1 

II 

(  bo(s)  \ 
sllb1(s) 

\a*9Ag0(a) 

sil>Aql(s)  ■■ 

•  slqAqq{s) ) 

i  i 

\  siqbq(s) ) 

where  Aiq  (s)  and  b^s)  are  continuous  matrix-valued  and  vector-valued  functions  of  s  with 
A;i(0)  non-singular  for  all  i. 


lim  A  1(s)y(s) 

s— >o 


Aoo(0)  0 

'•  0  \  _1  /  ^o(0) 

Aio(0)  An(0) 

0 

6t(0) 

Ago(0)  A9l(0)  •• 

KM) 

W  0) 

those  of  A(s). 


lqI)  with  the  blocks  of  P(s)  commensurate  with 

'  AoO) 

s?M0iO) 

siqA0q(s)  \ 

AoO) 

AtO) 

■  siq~^Alq(s) 

,  AoO) 

AiO) 

Aqq(s)  ) 

and 


s— >0 


which  is  invertible. 


lim  P(s)A(s)  =  I  Alo(0)  All(0) 


'  Aio(o) 

Ao(0) 

,Aq  q(0)  Agl(0) 


0 
0 


AO) ’ 


Noting  that  li 


lims^o  P{s)y(s)  -  bl^  , 


we  see  that  our  result  follows  from  corollary  1 


AO), 


applied  to  lims^o(-POMO))  1(P(s)y(s))-  □ 

We  are  now  ready  to  state  and  prove  the  main  result  of  this  section,  characterizing  the 
limiting  large-er  solution  of  Gaussian  RLS. 

Theorem  1.  Let  q  be  an  integer  satisfying  q  <  b,  and  let  p  =  2q  +  1.  Let  A  =  Ca~p  for 
some  constant  C.  Define  A^  =  ^ f?|(AiXt)0ci3j,  and  b ^  ^ B |  (Xx^)00. 

lim  ( K  +  nCcr~pI )  1  k  =  v 


where 


v  =  (B0  ■■■  Bq)w  (6) 


(C) 

(4? 

0 

0 

\ 

f 1 

= 

,<U 

^10 

A)  • 
Au 

0 

w) 

U? 

/lO)  . 

1 

•  •  A ^ 

Aqq 

) 

We  first  manipulate  the  equation  ( K  +  n\I)y  =  k  according  to  the  factorizations  in  (3) 
and  (5).  Defining 

N  =  diag  e~  ,  a  =  e~2^^x°^  , 

P  =  e^xx\  w  =  e^Xx»,  /3  =  nCa~p, 

(where  we  omit  for  brevity  the  dependencies  on  a)  we  have 

I<  =  diag  (e_;A||x|12)  e^XA'diag  (e~^x^  =  NPN 
k  =  diag  j  =  Nwa 


lim  e  SeAx°^  diag  HA  H  \  =  lim  aN  1  = /, 


Noting  that 


we  have 


v  =  lim  ( K  +  nCa  PI)  1k 

<J—*  OO 

=  lim  ( NPN  +  /?/)_1  Nwa 

a — »oo 

=  lim  aN~1{P  +  /3N^2)^1w 


=  lim  aN  1(P  +  f3N  2)  1w 


lim  (e^xxt  +  nCa~pdiag 


-1 


Changing  bases  with  Q , 

Q*v  =  jin^  (Qle^xxtQ  +  nCa~pQtd iag  (e^11*1'2)  q)  1  Qte^Xx°. 

Expanding  via  Taylor  series  and  writing  in  block  form  (in  the  b  x  b  block  structure  of  Q), 

QfSxxtQ  =  Qt{XXt)Q0Q  +  -±-Q\xXtf>1Q  +  1^—Qt(XXt)Q2Q  +  ■■■ 

1  \oA  zla 4 


/4o 

0  • 

••  0  \ 

1 

H — x 

/  A{1) 

1  ^00 

4(1)  .. 

^01 

•  0 

\ 

0 

0  • 

..  o' 

4(1) 

^10 

Aw  •• 

•  0 

1 

. 1 

a2 

V  0 

0  • 

■■  0  / 

l  ’o' 

0 

•  0 

) 

Qte^Xx±0  =  Qt(Xxt0)Q0  +  \q*(Xx  £)01  +  -^P^o)02  +  •  •  • 


\  1 

H - 2 

G 

(W\ 

V  0  > 

V  o  / 

nCa  PQ* diag  (e  »2  A  ^  Q  =  nCa  PI  +  ■  ■  ■ . 

Since  the  A{f  are  non-singular.  Lemma  3  applies,  giving  our  result.  □ 


5  The  classification  function 

When  performing  RLS,  the  actual  prediction  of  the  limiting  classifier  is  given  via 

foo(xo)=  lim  yt(K  +  nCa~pI)~1k. 

a — »oo 

Theorem  1  determines 

v  =  lim  (K  +  nCa~pI)~1k, 

(7 — »00 

showing  that  foo(xo)  is  a  polynomial  in  the  training  data  X.  In  this  section,  we  show 
that  foo(xo)  is,  in  fact,  a  polynomial  in  the  test  point  Xo-  We  continue  to  work  with  the 
orthonormal  vectors  Bi  as  well  as  the  auxilliary  quantities  A^-  '1  and  Ijf1  from  Theorem  1. 


Theorem  1  shows  that  v  £  V<q:  the  point  affinity  function  is  a  polynomial  of  degree  q  in 
the  training  data,  determined  by  (7). 


Y  dBiA^B]  =  (IX‘)0C 

i,j<c 

YdB^c)  =  (X4)0C 

i<c 


hence 


hence 


YdBcAgBt  =  BCB‘(IX()0C 

j<c 

<ABcb\c)  =  SCB‘(X4)0C 


we  can  restate  Equation  7  in  an  equivalent  form: 


B*\  1 

(0!A$ 

0 

0  \ 

-°0  \ 

1! 

_ 

1!A% 

0 

BtJ 

\q'4q) J 

q'Ai  ■ 

"  qij&J 

Bl' 


B\, 


=  0 


J 


Yc'-Bcb{c 

c<q 


-EE  c\BcA^j  BjV  =  0 

C<<?  J<C 


Y  BcBfc  ((X4)0C  -  (IIf)0c«)  =  0. 

c<g 


(8) 

(9) 

(10) 


Up  to  this  point,  our  results  hold  for  arbitrary  training  data  X.  To  proceed,  we  require  a 
mild  condition  on  our  training  set. 

Definition  4.  X  is  called  generic  if  X11 , . . . ,  XIn  are  linearly  independent  for  any  distinct 
multi-indices  {/;}. 


Lemma  4.  For  generic  X,  the  solution  to  Equation  7  (or  equivalently,  Equation  10)  is 
determined  by  the  conditions 


V/  :  \I\  <  q,  (X1)^  =  xl 


(11) 


where  v  £  V<q. 


Proof  By  definition,  V<q  =  span  {A' /  :  |/|  <  q]  and,  by  genericity,  the  vectors  X 1  where 
/ 1  <  q  <  b  are  linearly  independent.  Thus  (11)  reduces  to  a 


q  +  d 
d 


q  +  d 
d 


system 


of  linear  equations  with  unique  solution,  which  we  will  call  v.  We  now  show  that  v  satisfies 
(10). 

(XXt)©c  =  ^2  CrX^X1)*  and  (Ax*)0c  =  Y  cixI{4)* 

\I\=c  |/|=c 

Y  CIXI(XI)tv  =  Y  crXTxT0. 

|/|=c  \I\=C 

and  thus  (XXt)Qcv  =  (Xip)0c-  □ 


Theorem  2.  For  generic  data,  let  v  be  the  solution  to  Equation  10.  For  any  y  £  R", 
f(x  0)  =  ytv  =  h(x  o),  where  h(x)  =  Xqji<g  aix1  is  a  multivariate  polynomial  of  degree 
q  minimizing  \\y  —  /i(X)||. 

Proof  Since  h(X)  is  the  minimizer  of  | \y  —  h(X)  |  |, 

h(X)  =  (B0  •••  Bq)(B0  •••  Bqfy. 


Thus, 

h(X)*v  =  yt(B0  ■■■  Bq)(B0  ■■■  Bq)tv  =  ytv 

since  v  £  V<q. 

By  Lemma  5, 

h{ X^v  =  ai{XI)tv  =  cliXq  =  h(x o). 

\i\<q  W<q 


□ 


We  see  that  as  a  — >  oo,  the  RLS  solution  tends  to  the  minimum  empirical  error  fcth  order 
polynomial. 


6  Experimental  Verification 

In  this  section,  we  present  a  simple  experiment  that  illustrates  our  results.  We  consider  the 
fifth-degree  polynomial  function 

/(x)  =  .5(1  —  x)  +  150a;(x  —  .25)(x  —  .3)(x  —  .75)(x  —  .95), 

over  the  range  x  £  [0,1].  Figure  2  plots  /,  along  with  a  150  point  dataset  drawn  by  choosing 
Xi  uniformly  in  [0,1],  and  choosing  y  =  /(x)  +  e,,  where  e,  is  a  Gaussian  random  variable 
with  mean  0  and  standard  deviation  .05.  Figure  2  also  shows  (in  red)  the  best  polynomial 
approximations  to  the  data  (not  to  the  ideal  /)  of  various  orders.  (We  omit  third  order 
because  it  is  nearly  indistinguishable  from  second  order.) 


f(x),  Random  Sample  of  f(x),  and  Polynomial  Approximations 


Fig.  2.  f(x )  =  .5(1  —  x)  +  150x(x  —  .25)  (x  —  .3)  (x—  .75)  (x  —  .95),  a  random  dataset  drawn  from 
f(x)  with  added  Gaussian  noise,  and  data-based  polynomial  approximations  to  /. 


According  to  Corollary  1,  if  we  parametrize  our  system  by  a  variable  s,  and  solve  a  Gaus¬ 
sian  regularized  least  squares  problem  with  a2  =  s 2  and  A  =  Cs~<'2k+1'1  for  some  integer 


k,  then,  as  s  — >  oo,  we  expect  the  solution  to  the  system  to  tend  to  the  fcth-order  data- 
based  polynomial  approximation  to  /.  Asymptotically,  the  value  of  the  constant  C  does 
not  matter,  so  we  (arbitrarily)  set  it  to  be  1.  Figure  3  demonstrates  this  result. 

We  note  that  these  experiments  frequently  require  setting  A  much  smaller  than  machine  - 
e.  As  a  consequence,  we  need  more  precision  than  IEEE  double -precision  floating-point, 
and  our  results  cannot  be  obtained  via  many  standard  tools  (e.g.,  MATLAB(TM))  We  per¬ 
formed  our  experiments  using  CLISP,  an  implementation  of  Common  Lisp  that  includes 
arithmetic  operations  on  arbitrary-precision  floating  point  numbers. 


Oth  order  solution,  and  successive  approximations. 


1st  order  solution,  and  successive  approximations. 


4th  order  solution,  and  successive  approximations. 


5th  order  solution,  and  successive  approximations. 


Fig.  3.  As  s  — >  oo,  cr2  =  s2  and  A  =  s  (2k+1\  the  solution  to  Gaussian  RLS  approaches  the  fcth 
order  polynomial  solution. 


7  Discussion 


Our  result  provides  insight  into  the  asymptotic  behavior  of  RLS,  and  (partially)  explains 
Figure  1:  in  conjunction  with  additional  experiments  not  reported  here,  we  believe  that 
we  are  recovering  second-order  polynomial  behavior,  with  the  drop-off  in  performance  at 
various  A’s  occurring  at  the  transition  to  third-order  behavior,  which  cannot  be  accurately 
recovered  in  IEEE  double -precision  floating-point.  Although  we  used  the  specific  details 
of  RLS  in  deriving  our  solution,  we  expect  that  in  practice,  a  similar  result  would  hold  for 
Support  Vector  Machines,  and  perhaps  for  Tikhonov  regularization  with  convex  loss  more 
generally. 

An  interesting  implication  of  our  theorem  is  that  for  very  large  a,  we  can  obtain  various 
order  polynomial  classifications  by  sweeping  A.  In  [5],  we  present  an  algorithm  for  solving 
for  a  wide  range  of  A  for  essentially  the  same  cost  as  using  a  single  A.  This  algorithm  is  not 
currently  practical  for  large  a,  due  to  the  need  for  extended-precision  floating  point. 

Our  work  also  has  implications  for  approximations  to  the  Gaussian  kernel.  Yang  et  al.  use 
the  Fast  Gauss  Transform  (FGT)  to  speed  up  matrix-vector  multiplications  when  perform¬ 
ing  RLS  [7],  In  [5],  we  studied  this  work;  we  found  that  while  Yang  et  al.  used  moderate-to- 
small  values  of  a  (and  did  not  tune  A),  the  FGT  sacrificed  substantial  accuracy  compared 
to  the  best  achievable  results  on  their  datasets.  We  showed  empirically  that  the  FGT  be¬ 
comes  much  more  accurate  at  larger  values  of  a;  however,  at  large-cr,  it  seems  likely  we 
are  merely  recovering  low-order  polynomial  behavior.  We  suggest  that  approximations  to 
the  Gaussian  kernel  must  be  checked  carefully,  to  show  that  they  produce  sufficiently  good 
results  are  moderate  values  of  cr;  this  is  a  topic  for  future  work. 


References 

1.  Aronszain.  Theory  of  reproducing  kernels.  Transactions  of  the  American  Mathematical  Society, 
68:337-404,  1950. 

2.  Evgeniou,  Pontil,  and  Poggio.  Regularization  networks  and  support  vector  machines.  Advances 
In  Computational  Mathematics,  13(1):  1 — 50,  2000. 

3.  Keerthi  and  Lin.  Asymptotic  behaviors  of  support  vector  machines  with  gaussian  kernel.  Neural 
Computation,  15(7):  1667—1689,  2003. 

4.  Rifkin.  Everything  Old  Is  New  Again:  A  Fresh  Look  at  Historical  Approaches  to  Machine  Learn¬ 
ing.  PhD  thesis,  Massachusetts  Institute  of  Technology,  2002. 

5.  Rifkin  and  Lippert.  Practical  regularized  least-squares:  A-selection  and  fast  leave-one-out- 
computation.  In  preparation,  2005. 

6.  Wahba.  Spline  Models  for  Observational  Data,  volume  59  of  CBMS-NSF  Regional  Conference 
Series  in  Applied  Mathematics.  Society  for  Industrial  &  Applied  Mathematics,  1990. 

7.  Yang,  Duraiswami,  and  Davis.  Efficient  kernel  machines  using  the  improved  fast  Gauss  transform. 
In  Advances  in  Neural  Information  Processing  Systems,  volume  16,  2004. 


